Handle Metal OOM gracefully in mlx_lm.server with structured errors#1034
Handle Metal OOM gracefully in mlx_lm.server with structured errors#1034Aristide021 wants to merge 4 commits intoml-explore:mainfrom
Conversation
Classify generation failures in mlx_lm.server and return structured errors instead of crashing or misreporting as 404. - Detect Metal/MLX OOM errors and map them to HTTP 503 - Map other generation exceptions to HTTP 500 - Return structured JSON error payloads for non-stream responses - Emit terminal SSE error event + [DONE] for stream responses - Keep server alive after OOM - Defer non-stream 200 headers until success response is ready - Add OOM regression tests (stream + non-stream) in test_server.py - Document OOM behavior and mitigation knobs in SERVER.md
|
The OOM detection markers look correct for Apple Silicon. The main paths MLX raises on unified memory exhaustion are:
One gap: when The deferred-200 pattern for non-streaming is a good fix. The streaming error path handling (pre-stream vs mid-stream headers) is also correct. Minor: the error response includes |
- Add marker coverage for 'attempting to allocate' and 'maximum allowed buffer size' - Add regression test to ensure these variants map to HTTP 503
Classify generation failures in
mlx_lm.serverand return structured errors instead of crashing or misreporting as 404.[DONE]for stream responsestests/test_server.pymlx_lm/SERVER.mdinsufficient memory for buffer)Closes #854
Refs #1015
Aware of #948 (broader memory controls); this PR is intentionally scoped to crash-to-response handling and can merge independently.